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Classification using high-dimensional features arises frequently 
in many contemporary statistical studies such as tumor classifica- 
tion using microarray or other high-throughput data. The impact of 
dimensionality on classifications is poorly understood. In a seminal 
paper, Bickel and Levina [Bernoulli 10 (2004) 989-1010] show that 
the Fisher discriminant performs poorly due to diverging spectra and 
they propose to use the independence rule to overcome the problem. 
We first demonstrate that even for the independence classification 
rule, classification using all the features can be as poor as the ran- 
dom guessing due to noise accumulation in estimating population 
centroids in high-dimensional feature space. In fact, we demonstrate 
further that almost all linear discriminants can perform as poorly as 
the random guessing. Thus, it is important to select a subset of im- 
portant features for high-dimensional classification, resulting in Fea- 
tures Annealed Independence Rules (FAIR). The conditions under 
which all the important features can be selected by the two-sample 
f-statistic are established. The choice of the optimal number of fea- 
tures, or equivalently, the threshold value of the test statistics are 
proposed based on an upper bound of the classification error. Simu- 
lation studies and real data analysis support our theoretical results 
and demonstrate convincingly the advantage of our new classification 
procedure. 



1. Introduction. With rapid advance of imaging technology, high-through- 
put data such as microarray and proteomics data are frequently seen in many 
contemporary statistical studies. For instance, in the analysis of Microarray 
data, the dimensionality is frequently thousands or more, while the sample 
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size is typically in the order of tens [West et al. (2001) and Dudoit, Pridlyand 
and Speed (2002)]; see Fan and Ren (2006) for an overview. The large num- 
ber of features presents an intrinsic challenge to classification problems. For 
an overview of statistical challenges associated with high dimensionality, see 
Fan and Li (2006). 

Classical methods of classification break down when the dimensionality 
is extremely large. For example, even when the covariance matrix is known, 
Bickel and Levina (2004) demonstrate convincingly that the Fisher discrimi- 
nant analysis performs poorly in a minimax sense due to the diverging spec- 
tra (e.g., the condition number goes to infinity as dimensionality diverges) 
frequently encountered in the high-dimensional covariance matrices. Even if 
the true covariance matrix is not ill conditioned, the singularity of the sam- 
ple covariance matrix will make the Fisher discrimination rule inapplicable 
when the dimensionality is larger than sample size. Bickel and Levina (2004) 
show that the independence rule overcomes the above two problems. How- 
ever, in tumor classification using microarray data, we hope to find tens of 
genes that have high discriminative power. The independence rule, studied 
by Bickel and Levina (2004), does not possess this kind of properties. 

The difficulty of high-dimensional classification is intrinsically caused by 
the existence of many noise features that do not contribute to the reduction 
of misclassification rate. Though the importance of dimension reduction and 
feature selection has been stressed and many methods have been proposed 
in the literature, very little research has been done on theoretical analysis of 
the impacts of high dimensionality on classification. For example, using most 
discrimination rules such as the linear discriminants, we need to estimate 
the population mean vectors from the sample. When the dimensionality is 
high, even though each component of the population mean vectors can be 
estimated with accuracy, the aggregated estimation error can be very large 
and this has adverse effects on the misclassification rate. Therefore, when 
there is only a fraction of features that account for most of the variation in 
the data such as tumor classification using gene expression data, using all 
features will increase the misclassification rate. 

To illustrate the idea, we study independence classification rule. Specif- 
ically, we give an explicit formula on how the signal and noise affect the 
misclassification rates. We show formally how large the signal to noise ra- 
tio can be such that the effect of noise accumulation can be ignored, and 
how small this ratio can be before the independence classifier performs as 
poorly as the random guessing. Indeed, as demonstrated in Section 2, the 
impact of the dimensionality can be very drastic. For the independence rule, 
the misclassification rate can be as high as the random guessing even when 
the problem is perfectly classifiable. In fact, we demonstrate that almost all 
linear discriminants cannot perform any better than random guessing, due 
to the noise accumulation in the estimation of the population mean vectors, 
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unless the signals are very strong, namely the population mean vectors are 
very far apart. 

The above discussion reveals that feature selection is necessary for high- 
dimensional classification problems. When the independence rule is applied 
to selected features, the resulting Feature Annealed Independent Rules (FAIR) 
overcome both the issues of interpretability and the noise accumulation. 
One can extract the important features via variable selection techniques 
such as the penalized quasi-likelihood function. See Fan and Li (2006) for 
an overview. One can also employ a simple two-sample t-test as in Tibshi- 
rani et al. (2002) to identify important genes for the tumor classification, 
resulting in the nearest shrunken centroids method. Such a simple method 
corresponds to a componentwise regression method or a ridge regression 
method with ridge parameters tending to oo [Fan and Lv (2007)]. Hence, 
it is a specific and useful example of the penalized quasi-likelihood method 
for feature selection. It is surprising that such a simple proposal can indeed 
extract all important features. Indeed, we demonstrate that under suitable 
conditions, the two-sample t-statistic can identify all the features that effi- 
ciently characterize both classes. 

Another popular class of the dimension reduction methods is projection. 
They have been widely applied to the classification based on the gene expres- 
sion data. See, for example, principal component analysis in Ghosh (2002), 
Zou, Hastie and Tibshirani (2004) and Bair et al. (2006); partial least squares 
in Nguyen and Rocke (2002), Huang and Pan (2003) and Boulesteix (2004); 
and sliced inverse regression in Chiaromonte and Martinelli (2002), Anto- 
niadis, Lambert-Lacroix and Leblanc (2003) and Bura and Pfeiffer (2003). 
These projection methods attempt to find directions that can result in small 
classification errors. In fact, the directions found by these methods usually 
put much more weight on features that have large classification power. In 
general, however, linear projection methods are likely to perform poorly 
unless the projection vector is sparse, namely, the effective number of se- 
lected features is small. This is due to the aforementioned noise accumulation 
prominently featured in high-dimensional problems, recalling discrimination 
based on linear projections onto almost all directions can perform as poorly 
as the random guessing. 

As direct application of the independence rule is not efficient, we propose 
a specific form of FAIR. Our FAIR selects the statistically most significant 
m features according to the componentwise two-sample i-statistics between 
two classes, and applies the independence classifiers to these m features. 
Interesting questions include how to choose the optimal m, or equivalently, 
the threshold value of i-statistic, such that the classification error is min- 
imized, and how this classifier performs compared with the independence 
rule without feature selection and the oracle-assisted FAIR. All these ques- 
tions will be formally answered in this paper. Surprisingly, these results are 
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similar to those for the adaptive Neyman test in Fan (1996). The theoretical 
results also indicate that FAIR without oracle information performs worse 
than the one with oracle information, and the difference of classification er- 
ror depends on the threshold value, which is consistent with the common 
sense. 

There is a huge literature on classification. To name a few in addition 
to those mention before, Bai and Saranadasa (1996) dealt with the effect of 
high dimensionality in a two-sample problem from a hypothesis testing view- 
point; Friedman (1989) proposed a regularized discriminant analysis to deal 
with the problems associated with high dimension while performing compu- 
tations in the regular way; Dettling and Biihlmann (2003) and Biihlmann 
and Yu (2003) study boosting with logit loss and L2 loss, respectively, and 
demonstrate the good performances of these methods in high-dimensional 
setting; Greenshtein and Ritov (2004), Greenshtein (2006) and Meinshausen 
(2007) introduced and studied the concept of persistence, which places more 
emphasis on misclassification rates or expected loss rather than the accuracy 
of estimated parameters. 

This article is organized as follows. In Section 2, we demonstrate the 
impact of dimensionality on the independence classification rule, and show 
that discrimination based on projecting observations onto almost all linear 
directions is nearly the same as random guessing. We establish, in Section 3, 
the conditions under which two-sample i-test can identify all the important 
features with probability tending to 1. In Section 4, we propose FAIR and 
give an upper bound of its classification error. Simulation studies and real 
data analyses are conducted in Section 5. The conclusion of our study is 
summarized in Section 6. All proofs are given in the Appendix. 

2. Impact of high dimensionality. Consider the p-dimensional classifica- 
tion problem between two classes C\ and C2. Suppose that from class C k , we 
have n k observations Y^i, . . . , Y krik in W. The jth. feature of the ith sample 
from class C k satisfies the model 

(2.1) Y kij =fj, kj + e kij , k = l,2, i = l,...,n k , j = l,...,p, 

where fi k j is the mean effect of the jth feature in class C k and e k ij is the cor- 
responding Gaussian random noise for ith observation. In matrix notation, 
the above model can be written as 

Yfcj = H k + e ki, k = 1, 2, i = l,...,n k , 

where fi k = ([J, k x, . . . , fi k p)' is the mean vector of class C k and e ki = (e k n, . . . , 
e kip)' has the distribution iV^O,^). We assume that all observations are 
independent across samples and in addition, within class C k , observations 
Y k i, . . . ,Y kllk are also identically distributed. Throughout this paper, we 
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make the assumption that the two classes have compatible sample sizes, 
that is, ci < n\/ri2 < c 2 with c\ and c 2 some positive constants. 

We first investigate the impact of high dimensionality on classification. 
For simplicity, we temporarily assume that the two classes C\ and C 2 have 
the same covariance matrix S. To illustrate our idea, we consider the inde- 
pendence classification rule, which classifies the new feature vector x into 
class C\ if 

5(x) = (x-/x) , D- 1 (/ii-/^ 2 )>0, 

where /x = + /x 2 )/2 and D = diag(S). This classifier has been thoroughly 
studied in Bickel and Levina (2004). They showed that in the classification 
of two normal populations, this independence rule greatly outperforms the 
Fisher linear discriminant rule under broad conditions when the number of 
variables is large. 

The independence rule depends on the marginal parameters /x 2 and 
D = diagjcf , . . . , <Tp}. They can easily be estimated from the samples 

P-k = ^2 Y ki/n k , A; = 1,2, p,= (p, 1 + fi 2 )/2 

i=l 

and 

D = diag{(5 1 2 i + S 2 2j )/2, j = l,...,p}, 

where = J27=i(^kij — Ykj) 2 /{^k — 1) is the sample variance of the jth 
feature in class k and Yjy = Y^i=i^ki/ n k- Hence, the plug-in discrimination 
function is 

<5(x) = (x - /l/fr 1 ^ - £2). 

Denote the parameter by 6 = fi 2 , If we have a new observation X 
from class C±, then the misclassification rate of 5 is 

(2.2) W(5,e)=P(6(X)<0\Y ki ,i = l,...,n k , k = 1,2) = 1 - $(*), 

where 

v/^-^yD-iED-i^-^)' 

and $(•) is the standard Gaussian distribution function. The worst case 
classification error is 

W(6)=maxW(S,G), 

where T is some parameter space to be defined. Let n = n\ + n 2 . In our 
asymptotic analysis, we always consider the misclassification rate of obser- 
vations from C\ , since the misclassification rate of observations from C 2 can 
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be easily obtained by interchanging n\ with n 2 and with fj, 2 - The high 
dimensionality is modeled through its dependence on n, namely p n — > oo. 
However, we will suppress its dependence on n whenever there is no confu- 
sion. 

Let R = D^ 1 / 2 5]D~ 1 / 2 be the correlation matrix, and A max (R) be its 
largest eigenvalue, and a = (ai, . . . , a p )' = — /x 2 - Consider the parameter 
space 

r=((a,S):a / D- 1 Q>C p ,A max (R)<6 , min a% > 

where C p is a deterministic positive sequence that depends only on the di- 
mensionality p, and bo is a positive constant. Note that a'D ^a corresponds 
to the overall strength of signals, and the first condition a'D ^a > C p 
imposes a lower bound on the strength of signals. The second condition 
Amax(R-) < bo requires that the maximum eigenvalue of R should not ex- 
ceed a positive constant. But since there are no restrictions on the smallest 
eigenvalue of R, the condition number can still diverge. The third condition 
mini<j< p fc = i i 2 cr^ > ensures that there are no deterministic features that 
make classification trivial and the diagonal matrix D is always invertible. 
We will consider the asymptotic behavior of W(5,6) and W(5). 

Theorem 1. Suppose that logp = o(n), n = o(p) and nC p — ► oo. Then: 
(i) The classification error W(5,6) with 6 £ T is bounded from above as 
[nin2/(pn)] 1 ^ 2 a''D~ 1 ot(l + o P (l)) + vW(™i"2)("i ~ n 2 



W(6,0) < 1-$ 



2VAmax(R){l + n 1 n 2 /(pn)a'D- 1 a(l + o P (l))} 1 / 2 



(ii) Suppose p/(nC p ) — ► 0. For the worst case classification error W(5), 
we have 

W{5) = 1 - ^[n in2 /{pnbo)] 1/2 C p {l + o P (l)}). 
Specifically, when {^j^-^^Cp — > Co urai/i Co a nonnegative constant, then 

in particular, if Co = 0, i/ien W(5) — > ^. 

Theorem 1 reveals the trade-off between the signal strength C p and the 
dimensionality, reflected in the term C p /^/p when all features are used for 
classification. It states that the independence rule 5 would be no better than 
the random guessing due to noise accumulation, unless the signal levels are 
extremely high, say, {— J- 1 ' 2 ^ > B for some B > 0. Indeed, discrimination 
based on linear projections to almost all directions performs nearly the same 
as random guessing, as shown in the theorem below. The poor performance 
is caused by noise accumulation in the estimation of /x x and /i 2 . 
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Theorem 2. Suppose that a is a p- dimensional uniformly distributed 
unit random vector on a (p — 1)- dimensional sphere. Let X\,...,X P be the 
eigenvalues of the covariance matrix S. Suppose lim p -y X^/=i < 00 an< ^ 
lirrip - 2~Zj=i Aj = t TOi/i t a positive constant. Moreover, assume that 
p a' a — ► 0. TTien z/ we project all the observations onto the vector a and 
wse i/te classifier 

(2.3) <5 a (x) = (a'x - a'/i)(a / /i 1 - a'/2 2 ), 

i/te misclassification rate of 5 a satisfies 

P(*a(X) < 0|Y W , i = l,...,n fc , fc = l,2)-^|, 
where the probability is taken with respect to a and X 6 Ci . 



3. Feature selection by two-sample t-test. To extract salient features, 
we appeal to the two-sample t-test statistics. Other componentwise tests 
such as the rank sum test can also be used, but we do not pursue those in 
detail. The two-sample i-statistic for feature j is defined as 



(3-1) T j= , JM I2J =, j = l,...,P, 

'Sl/n, + Sl/n 2 



>lj/m + S 2j/ 

where Y^j and Sf,- are the same as those defined in Section 1 . We work under 
more relaxed technical conditions: the normality assumption is not needed. 
Instead, we assume merely that the noise vectors e^i, i = 1, . ■ . , n&, are i.i.d. 
within class C k with mean and covariance matrix X^., and are independent 
between classes. The covariance matrix Si can also differ from E 2 . 

To show that the t-statistic can select all the important features with 
probability 1, we need the following condition. 



Condition 1. 

(a) Assume that the vector a = fi l — /x 2 is sparse and without loss of 
generality, only the first s entries are nonzero. 

(b) Suppose that ekij and e|j ■ — 1 satisfy the Cramer's condition, that is, 
there exist constants v\, v 2 , M\ and M 2 , such that E\ekij\ m < m\M^~ 2 v\j2 
and E\e 2 kij - a 2 kj \ m < m!Af 2 m "V 2 /2 for all m = 1, 2, . . . . 

(c) Assume that the diagonal elements of both £1 and S 2 are bounded 
away from 0. 



The following theorem describes the situation under which the two-sample 
t-test can pick up all important features by choosing an appropriate critical 
value. Recall that c\ < n\jn 2 < c 2 and n = n\ + n 2 . 
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Theorem 3. Lets be a sequence such that log(p — s) = o(n 7 ) and log s = 
some (5 n —> oo and < 7 < | . Suppose that mini<j< s , " 3 2 

n~ 7 /? n . Then under Condition 1, for x ~ en 1 / 2 with c some positive constant, 
we have 

P[ min \Tj \ > x and max |T,-| < x ) — > 1. 

\j<s j>s J 

In the proof of Theorem 3, we used the moderate deviation results of the 
two-sample t-statistic [see Cao (2007) or Shao (2005)]. Theorem 3 allows the 
lowest signal level to decay with sample size n. As long as the rate of decay 
is not too fast and the sample size is not too small, the two-sample £-test 
can pick up all the important features with probability tending to 1. 

4. Features annealed independence rules. We apply the independence 
classifier to the selected features, resulting in a Features Annealed Indepen- 
dence Rule (FAIR). In many applications such as tumor classification using 
gene expression data, we would expect that elements in the population mean 
difference vector a are sparse: most entries are small. Thus, even if we could 
use i-test to correctly extract out all these features, the resulting choice is 
not necessarily optimal, since the noise accumulation can even exceed the 
signal accumulation for faint features. This can be seen from Theorem 1. 
Therefore, it is necessary to further single out the most important features 
that help reduce misclassification rate. 

To help us select the number of features, or the critical value of the test 
statistic, we first consider the ideal situation that the important features 
are located at the first m coordinates and our task is to merely select m 
to minimize the misclassification rate. This is the case when we have the 
ideal information about the relative importance of features, as measured by 
\aj\/aj, say. When such an oracle information is unavailable, we will learn 
it from the data. In the situation that we have vague knowledge about the 
importance of features such as tumor classification using gene expression 
data, we can give high ranks to features with large \a.j\/o~j. 

In the presentation below, unless otherwise specified, we assume that the 
two classes C\ and C2 are both from Gaussian distributions and the common 
covariance matrix is the identity, that is, Si = S2 = I- If this common co- 
variance matrix is known, the independence classifier 5 becomes the nearest 
centroids classifier 

<5 N c(x) = (x-/i) / (/i 1 -^ 2 ). 

If only the first m dimensions are used in the classification, the corresponding 
features annealed independence classifier becomes 

%c( x ) = ( x "A 1 ) (Mi 
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where the superscript m means that the vector is truncated after the first m 
entries. This is indeed the same as the nearest shrunken centroids method 
of Tibshirani et al. (2002). 



m n . Suppose that , n J2T=i a ? ~^ 00 as m n —* 00 . Then the classification 



Theorem 4. Consider the truncated classifier o^Q for a given sequence 
^ m " 1 a 2 

error of S^Q is 

■ (1 + o P (l)) Ef=i ol) + m n (ni - n 2 )/{n 1 n 2 ) ' 



W(6™Z,0) 1 ^ 2{(1 + 0p(1))E p iQ ^ + nmn/(nin2)} i/ 2 

where n = m + n 2 as defined in Section 2. 

In the following, we suppress the dependence of m on n when there is no 
confusion. The above theorem reveals that the ideal choice on the number 
of features is 

E*U o# + m(m - n 2 )/( ni n 2 )] 2 

m = arg max — J - J — g . 

i<m< P nm/[nin 2 ) + 2^- =1 oq 

It can be estimated as 

E*Li a? + ™Ol - ^2)/(nin 2 )] 2 

m = arg max ^— — — ^ > 

i<m< P nm/{n\n 2 ) + 2^ =1 ag 

where ay = /tij — /t2j. The expression for mo quantifies how the 
signal and the noise affect the misclassification rates as the dimensional- 
ity m increases. In particular, when n\ = n 2 , the express reduces to too = 

argmax 1< , T1< „ — -. — ■ V ^»' J ~ 1 2 t ■ The term to" 1 ' 2 Y^TL-i a 2 reflects the trade-off 

between the signal and noise as dimensionality to increases. 

The good performance of the classifier <5j^ c depends on the assumption 
that the largest entries of a cluster at the first m dimensions. An ideal 
version of the classifier (5nc is to select a subset A = {j : \ctj \ > a} and use 
this subset to construct independence classifier. Let to be the number of 
elements in A. The oracle classifier can be written as 

v 

^orc(x) = & A x j ~ Aj)l{|a 3 |>a}- 
5=1 

The misclassification rate is approximately 

/ S je ^^ 2 + - n 2 )/{ ni n 2 ) 
1 ' 1 V 2W(n 1 n 2 )+E je ^« 2 }V2 

when ^= J2jeA a j ~^ 00 anc ^ m — * 00 • This is straightforward from Theorem 
4. In practice, we do not have such an oracle, and selecting the subset A 
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is difficult. A simple procedure is to use the feature annealed independence 
rule based on the hard thresholding: 

v 

<5fair( x ) = & i( x i ~ N)h\&j\>b}- 

We study the classification error of FAIR and the impact of the threshold b 
on the classification result in the following theorem. 

Theorem 5. Suppose that maxjg^c \ctj\ < b n and log(p — m)/[n(b n — 
maxjg^c |<x,|) 2 ] — > with m = \A\ . Moreover, assume that J^jeA ] ~~ ^ 00 

andJ2j£A\ a j\/[VnJ2j£A a 'j]~ > 'Q- Then 

w ,$bn W1 + op(1)) EjeA^j + nm{ ni n 2 )- 1 - mb 2 n \ 

W{d FA1R ,V)<l <P ^ m - Qp{i)) — ^ - nm{nin2 y l}1/2 ) ■ 

Notice that the upper bound of W"(5faiR' 0) in Theorem 5 is greater 
than the classification error in Theorem 4, and the magnitude of difference 
depends on mb^. This is expected as estimating the set A increases the 
classification error. These results are similar to those in Fan (1996) for high- 
dimensional hypothesis testing. 

When the common covariance matrix is different from the identity, FAIR 
takes a slightly different form to adapt to the unknown componentwise vari- 
ance: 

v 

(4.2) £fair(x) = J2 & i( x i ~ Ai)/*! ! {% /V(^|T,|>6}> 

where Tj is the two-sample t-statistic. It is clear from (4.2) that FAIR works 
the same way as that we first sort the features by the absolute values of their 
t-statistics in the descending order, and then take out the first m features 
to classify the data. The number of features can be selected by minimizing 
the upper bound of the classification error given in Theorem 1. The optimal 
m in this sense is 

1 [Eili«?/^ 2 + ^(l/^2-l/ni)] 2 
mi = arg max — J - J 2T~2 » 

where A™ ax is the largest eigenvalue of the correlation matrix R m of the 
truncated observations. It can be estimated from the samples: 

1 E^i^/^ + mq/na-l/m)] 2 



mi = arg max 



(4.3) 



i<m< P A™ ax nm/{nm 2 )+T,JLi & j/ & j 
1 n^JLiTf + mim-^)/^ 2 



arg max 



i<m< P A™ ax mn x n 2 + n x n 2 TJjLi T f 
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Note that the factor A™ ax in (4.3) increases with m, which makes rh\ usually 
smaller than mo- 

5. Numerical studies. In this section we use a simulation study and three 
real data analyses to illustrate our theoretical results and to verify the per- 
formance of our newly proposed classifier FAIR. 

5.1. Simulation study. We first introduce the model. The covariance ma- 
trices Si and S2 for the two classes are chosen to be the same. For the dis- 
tribution of the error e« in (2.1), we use the same model as that in Fan, Hall 
and Yao (2006). Specifically, features are divided into three groups. Within 
each group, features share one unobservable common factor with different 
factor loadings. In addition, there is an unobservable common factor among 
all the features across three groups. For simplicity, we assume that the num- 
ber of features p is a multiple of 3. Let Zij be a sequence of independent 
iV(0, 1) random variables, and xjj be a sequence of independent random 
variables of the same distribution as (x 2 d — d)/ y/2d with x\ t ne Chi-square 
distribution with degrees of freedom d. In the simulation we set d = 6. 

Let {cij} and {bj} be factor loading coefficients. Then the error in (2.1) 
is defined as 

_ + aijXu + a>2jX2i + a3jX3i + bj Xa . _ -, 

€ij ~ (l + alj + alj + alj + b])^ ' 3 — 1, ... , rik, j — 1, . . . ,p, 

where ajj = except that a\j = a,j for j = 1, . . . ,p/3, a2j = aj for j = (p/3) + 
1, . . . , 2p/3, and a^j = aj for j = (2p/3) + 1, . . . ,p. Therefore, Eeij = and 
var(ejj) = 1, and in general, within group correlation is greater than the 
between group correlation. The factor loadings aj and bj are independently 
generated from uniform distributions [7(0,0.4) and C/(0,0.2). The mean vec- 
tor /x 1 for class C\ is taken from a realization of the mixture of a point mass 
at and a double-exponential distribution: 

(1 - c)5q + |cexp(— 2|»|), 

where c G (0, 1) is a constant. In the simulation, we set p = 4500 and c = 0.02. 
In other words, there are around 90 signal features on an average, many of 
which are weak signals. Without loss of generality, fi 2 1S se ^ to be 0. Figure 1 
shows the true mean difference vector a, which is fixed across all simulations. 
It is clear that there are only very few features with signal levels exceeding 
1 standard deviation of the noise. 

With the parameters and model above, for each simulation, we generate 
n\ = 30 training data from class C\ and n 2 = 30 training data from C 2 - In 
addition, separate 200 samples are generated from each of the two classes in 
each simulation, and these 400 vectors are used as test samples. We apply 
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our newly proposed classifier FAIR to the simulated data. Specifically, for 
each feature, the i-test statistic in (3.1) is calculated using the training 
sample. Then the features are sorted in the decreasing order of the absolute 
values of their t-statistics. We then examine the impact of the number of 
features m on the misclassification rate. In each simulation, with m ranging 
from 1 to 4500, we construct the feature annealed independence classifiers 
using the training samples, and then apply these classifiers to the 400 test 
samples. The classification errors are compared to those of the independence 
rule with the oracle ordering information, which is constructed by repeating 
the above procedure except that in the first step the features are ordered by 
their true signal levels, \oc\, instead of by their t-statistics. 

The above procedure is repeated 100 times, and averages and standard 
errors of the misclassification rates (based on 400 test samples in each sim- 
ulation) are calculated across the 100 simulations. Note that the average of 
the 100 misclassification rates is indeed computed based on 100 x 400 testing 
samples. 

Figure 2 depicts the misclassification rate as a function of the number 
of features m. The solid curves represent the average of classification rates 
across the 100 simulations, and the corresponding dashed curves are 2 stan- 
dard errors (i.e., the standard deviation of 100 misclassification rates divided 
by 10) away from the solid one. The misclassification rates using the first 
80 features in Figure 2(a) are zoomed in Figure 2(b). Figures 2(c) and 2(d) 
are the same as 2(a) and 2(b) except that the features are arranged in the 
decreasing order of \a\, that is, the results are based on the oracle-assisted 
feature annealed independence classifier. From these plots we see that the 
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Fig. 1. True mean difference vector a. x-axis represents the dimensionality, and y-axis 
shows the values of corresponding entries of a . 
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Fig. 2. Number of features versus misclassification rates. The solid curves represent the 
averages of classification errors across 100 simulations. The dashed curves are 2 standard 
errors away from the solid curves. The x-axis represents the number of features used in the 
classification, and the y-axis shows the misclassification rates, (a) The features are ordered 
in a way such that the corresponding t-statistics are decreasing in absolute values, (b) The 
amplified plot of the first 80 values of x-axis in plot (a,), (c) The same as ( a) except that 
the features are arranged in a way such that the corresponding true mean differences are 
decreasing in absolute values. (A) The amplified plot of the first 80 values of x-axis in plot 
(c). 

classification results of FAIR are close to those of the oracle-assisted inde- 
pendence classifier. Moreover, as the dimensionality m grows, the misclassi- 
fication rate increases steadily due to the noise accumulation. When all the 
features are included, that is, m = 4500, the misclassification rate is 0.2522, 
whereas the minimum classification errors are 0.0128 in plot 2(b) and 0.0020 
in plot 2(d). These results are consistent with Theorem 1. We also tried to 
decrease the signal levels, that is, the mean of the double exponential distri- 
bution, or to increase the dimensionality p, and found that the classification 
error tend to 0.5 when all the dimensions are included. Comparing Fig- 
ures 2(a) and 2(b) to Figures 2(c) and 2(d), we see that the features ordered 
by t-statistics has higher misclassification rates than those ordered by the 
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Fig. 3. Classification errors of the independence rule based on projected samples onto 
randomly chosen directions over 100 simulations. 

oracle. Also, using t-statistics results in larger minimum classification errors 
[see plots 2(b) and 2(d)], but the differences are not very large. 

Figure 3 shows the classification errors of the independence rule based on 
projected samples onto randomly chosen directions across 100 simulations. 
Specifically, in each of the simulations in Figure 2, we generate a direction 
vector a randomly from the (p — 1) -dimensional unit sphere, then project 
all the data in that simulation onto the direction a, and finally apply the 
Fisher discriminant to the projected data [see (2.3)]. The average of these 
misclassification rates is 0.4986 and the corresponding standard deviation is 
0.0318. These results are consistent with our Theorem 2. 

Finally, we examine the effectiveness of our proposed method (4.3) for 
selecting features in FAIR. In each of the 100 simulations, we apply (4.3) to 
choose the number of features and compute the resulting misclassification 
rate based on 400 test samples. We also use the nearest shrunken centroids 
of Tibshirani et al. (2002) to select the important features. Figure 4 sum- 
marizes these results. The thin curves correspond to the nearest shrunken 
centroids method, and the thick curves correspond to FAIR. Figure 4(a) 
presents the number of features calculated from these two methods, and 
Figure 4(b) shows the corresponding misclassification rates. For our newly 
proposed classifier FAIR, the average of the optimal number of features 
over 100 simulations is 29.71, which is very close to the smallest number of 
features with the minimum misclassification rate in Figure 2(d). The mis- 
classification rates of FAIR in Figure 4(b) have average 0.0154 and standard 
deviation 0.0085, indicating the outstanding performance of FAIR. Nearest 
shrunken centroids method is unstable in selecting features. Over the 100 
simulations, there are several realizations in which it chooses plenty of fea- 
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Fig. 4. The thick curves correspond to FAIR, while the thin curves correspond to the 
nearest shrunken centroids method. (&) The numbers of features chosen by (4-3) and by the 
nearest shrunken centroids method over 100 simulations, (b) Corresponding classification 
errors based on the optimal number of features chosen in (a.) over 100 simulations. 

tures. We truncated Figure 4 to make it easier to view. The average number 
of features chosen by the nearest shrunken centroids is 28.43, and the av- 
erage classification error is 0.0216, with corresponding standard deviation 
0.0179. It is clear that nearest shrunken centroids method tends to choose 
less features than FAIR, but the misclassification rates are larger. 

5.2. Real data analysis. 

5.2.1. Leukemia data. Leukemia data from high-density Affymetrix 
oligonucleotide arrays were previously analyzed in Golub et al. (1999), and 
are available at http : //www .broad. mit . edu/ cgi-bin/cancer/datasets . cgi. 
There are 7129 genes and 72 samples coming from two classes: 47 in class 
ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous 
leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class 
AML) are set to be training samples and 34 (20 in class ALL and 14 in class 
AML) are set as test samples. 

Before classification, we standardize each sample to zero mean and unit 
variance as done by Dudoit, Fridlyand and Speed (2002). The classification 
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results from the nearest shrunken centroids (NSC hereafter) method and 
FAIR are shown in Table 1. The nearest shrunken centroids method picks 
up 21 genes and makes 1 training error and 3 test errors, while our method 
chooses 11 genes and makes 1 training error and 1 test error. Tibshirani 
et al. (2002) proposed and applied the nearest shrunken centroids method 
to the unstandardized Leukemia dataset. They chose 21 genes and made 1 
training error and 2 test errors. Our results are still superior to theirs. 

To further evaluate the performance of the two classifiers, we randomly 
split the 72 samples into training and test sets. Specifically, we set approxi- 
mately 1007% of the observations from class ALL and 1007% °f the obser- 
vations from class AML as training samples, and the rest as test samples. 
FAIR and NSC are applied to the training data, and their performances are 
evaluated by the test samples. The above procedure is repeated 100 times 
for 7 = 0.4, 0.5 and 0.6, respectively, and the distributions of test errors of 
FAIR, NSC and the independence rule without feature selection are sum- 
marized in Figure 5. In each of the splits, we also calculated the difference 
of test errors between NSC and FAIR, that is, the test error of FAIR mi- 
nus that of NSC, and the distribution is summarized in Figure 5. The top 
panel of Figure 6 shows the number of features selected by FAIR and NSC 
for 7 = 0.4. The results for the other two values of 7 are similar so we do 
not present here to save the space. From these figures we can see that the 
performance of independence rule improves significantly after feature selec- 
tion. The classification errors of NSC and FAIR are approximately the same. 
As we have already noticed in the simulation study, NSC is not good with 
feature selection, that is, the number of features selected by NSC is very 
large and unstable, while the number of features selected by FAIR is quite 
reasonable and stable over different random splits. Clearly, the independent 
rule without feature selection performs poorly. 

5.2.2. Lung cancer data. We evaluate our method by classifying between 
malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of 
the lung. Lung cancer data were analyzed by Gordon et al. (2002) and are 
available at http://www.chestsurg.org. There are 181 tissue samples (31 
MPM and 150 ADCA). The training set contains 32 of them, with 16 from 



Table 1 

Classification errors of Leukemia dataset 



Method 


Training error 


Test error 


No. of selected genes 


Nearest shrunken centroids 


1/38 


3/34 


21 


FAIR 


1/38 


1/34 


11 
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Fig. 5. Leukemia data. Boxplots of test errors of FAIR, NSC and the independence rule 
without feature selection over 100 random splits of 72 samples, where 1007% of the samples 
from both classes are set as training samples. The three plots from left to right correspond 
to 7 = 0.4,0.5 and 0.6, respectively. In each boxplot above, "FAIR" refers to the test errors 
of the feature annealed independent rule; " NSC" corresponds to the test errors of nearest 
shrunken centroids method; u diff." means the difference of the test errors of FAIR and 
those of NSC; and U IR" corresponds the test errors of independence rule without feature 
selection. 



MPM and 16 from ADCA. The rest 149 samples are used for testing (15 
from MPM and 134 from ADCA). Each sample is described by 12533 genes. 

As in the Leukemia dataset, we first standardize the data to zero mean 
and unit variance, and then apply the two classification methods to the stan- 
dardized dataset. Classification results are summarized in Table 2. Although 
FAIR uses 5 more genes than the nearest shrunken centroids method, it 
has better classification results: both methods perfectly classify the training 
samples, while our classification procedure has smaller test error. 

We follow the same procedure as that in Leukemia example to randomly 
split the 181 samples into training and test sets. FAIR and NSC are applied 
to the training data, and the test errors are calculated using the test data. 
The procedure is repeated 100 times with 7 = 0.4, 0.5 and 0.6, respectively, 
and the test error distributions of FAIR, NSC and the independence rule 
without feature selection can be found in Figure 7. We also present the dif- 
ference of the test errors between FAIR and NSC in Figure 7. The numbers 
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Fig. 6. Leukemia, Lung cancer and Prostate datasets. The number of features selected 
by FAIR and NSC over 100 random splits of the total samples. In each split, 1007% of the 
samples from both class are set as training samples, and the rest are used as test samples. 
The three plots from top to bottom correspond to the Leukemia data with 7 = 0.4, the Lung 
cancer data with 7 = 0.5 and the Prostate cancer data with 7 = 0.6, respectively. The thin 
curves show the results from NSC, and the thick curves correspond to FAIR. The plots are 
truncated to make them easy to view. 



of features used by FAIR and NSC with 7 = 0.5 are shown in the middle 
panel of Figure 6. Figure 7 shows again that feature selection is very impor- 
tant in high-dimensional classification. The performance of FAIR is close to 
NSC in terms of classification error (Figure 7), but FAIR is stable in feature 
selection, as shown in the middle panel of Figure 6. One possible reason of 
Figure 7 might be that the signal strength in this Lung cancer dataset is 
relatively weak, and more features are needed to obtain the optimal per- 
formance. However, the estimate of the largest eigenvalue is not accurate 
anymore when the number of features is large, which results in inaccurate 
estimates of mi in (4.3). 

5.2.3. Prostate cancer data. The last example uses the prostate cancer 
data studied in Singh et al. (2002). The dataset is available at 
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Table 2 

Classification errors of Lung cancer data 



Method Training error Test error No. of selected genes 

Nearest shrunken centroids 0/32 11/149 26 

FAIR 0/32 7/149 31 



Table 3 

Classification errors of Prostate cancer dataset 



Method 


Training error 


Test error 


No. of selected genes 


Nearest shrunken centroids 


8/102 


9/34 


6 


FAIR 


10/102 


9/34 


2 



http : //www.broad.mit . edu/cgi-bin/ cancer /dataset s . cgi. The train- 
ing dataset contains 102 patient samples, 52 of which (labeled as "tumor") 
are prostate tumor samples and 50 of which (labeled as "Normal") are 
prostate samples. There are around 12600 genes. An independent set of 
test samples is from a different experiment and has 25 tumor and 9 normal 
samples. 
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Fig. 7. Lung cancer data. The same as Figure 5 except that the dataset is different. 
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We preprocess the data by standardizing the gene expression data as 
before. The classification results are summarized in Table 3. We make the 
same test error as and a bit larger training error than the nearest shrunken 
centroids method, but the number of selected genes we use is much less. 

The samples are randomly split into training and test sets in the same 
way as before, the test errors are calculated, and the number of features 
used by these two methods are recorded. Figure 8 shows the test errors of 
FAIR, NSC and the independence rule without feature selection, and the 
difference of the test errors of FAIR and NSC. The bottom panel of Figure 6 
presents the numbers of features used by FAIR and NSC in each random 
split for 7 = 0.6. As we mentioned before, the plots for 7 = 0.4 and 0.5 are 
similar so we omit them in the paper. The performance of FAIR is better 
than that of NSC both in terms of classification error and in terms of the 
selection of features. The good performance of FAIR might be caused by 
the strong signal level of few features in this dataset. Due to the strong 
signal level, FAIR can attain the optimal performance with small number of 
features. Thus, the estimate of m\ in (4.3) is accurate and hence the actual 
performance of FAIR is good. 

6. Conclusion. This paper studies the impact of high dimensionality on 
classifications. To illustrate the idea, we have considered the independence 
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Fig. 8. Prostate cancer data. The same as Figure 5 except that the dataset is different. 
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classification rule, which avoids the difficulty of estimating large covariance 
matrix and the diverging condition number frequently associated with the 
large covariance matrix. When only a subset of the features capture the 
characteristics of two groups, classification using all dimensions would in- 
trinsically classify the noises. We prove that classification based on linear 
projections onto almost all directions performs nearly the same as random 
guessing. Hence, it is necessary to choose direction vectors which put more 
weight on important features. 

The two-sample i-test can be used to choose the important features. We 
have shown that under mild conditions, the two-sample i-test can select all 
the important features with probability one. The features annealed indepen- 
dence rule using hard thresholding, FAIR, is proposed, with the number of 
features selected by a data-driven rule. An upper bound of the classification 
error of FAIR is explicitly given. We also give suggestions on the optimal 
number of features used in classification. Simulation studies and real data 
analysis support our theoretical results convincingly. 

APPENDIX 



Proof of Theorem 1. For e T, ^ defined in (2.2) can be bounded 



as 



(A.i) fci-pyD Hfii-ih) 



VAi - feyo-™- 1 ^! - £ 2 ) 

where we have used the assumption that A m ax 

(R) < &o. Denote by 
Oi - /2)'D _1 (/*i -/2 2 ) 



(£ 1 -£ 2 )'D-iDD-i(/ii-/22) 



We next study the asymptotic behavior of 

Since Condition 1(b) in Section 3 is satisfied automatically for normal 
distribution, by Lemma A. 2 below we have D = D(l + op(l)), where op(l) 
holds uniformly across all diagonal elements. Thus, the right-hand side of 
(A.I) can be written as 

1 (mi-£)'D -1 (a»i-A*2) n h u 

-.(l + Op(l)). 



We first consider the denominator. Notice that it can be decomposed as 
(Ai - A2)'D _1 (/ii - fe) 



°3 °i 
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(A.2) 



= a'D + 2(1 + o P (l))/i + I 2 , 
where cr| is the jth diagonal entry of D, <rj is the jth diagonal entry of 

D, and e kj = ££i e fcii /n fc , fc = 1,2. Notice that ei - e 2 ~ JV(0, ^S). By 
singular value decomposition we have 

R = Q*v fl Q' fl) 

where Qp is orthogonal matrix and Vp = diagjAp^, . . . , Ap iP } be the eigen- 
values of the correlation matrix R. Define e = ^nin2/nV R 1 ^ 2 Q' R D^ 1 ^ 2 (ei — 
e 2 ), then e~ iV(0, 1). Hence, 

J 2 = (e x - ezJ'D-^ei - e 2 ) = — e'Vpe. 

nin 2 

Since Y^ =1 \ Rji = p and Ap,j > for alH = 1, . . . ,p, we have ^ J^Li ^p,i < 
oo. By the weak law of large number we have 

(A. 3) nin 2 / 2 /[pn] — >1 as n — > oo,p — > oo. 

Next, we consider Ji. Note that /i has the distribution Ji ~ N(0, n ™ n2 a'D -1 
SD _1 a). Since A max < &0i raai'D -1 ^ > nC p — ► oo and 

a'D^ED^a = q'D-^RD'^q < A max (i?)a / D- 1 a, 
we have Ii = a / D _1 o;op(l). This together with (A.2) and (A. 3) yields 

(a.4) ™» (ft - feyfi- 1 ^ - fe) = i + ^ E 4a + Mi))- 

pn pn j— J cr| 

Now, we consider the numerator. It can be decomposed as 

Oi-ziyfrHAi - £2) 

= la'D^a - £ fp 2j ) - 1(1 + 0P (1)) £ e?,Axf 



1 



+ -(l + 0p (l))^ e %./a 
1 

Denote by I% = J2 ^i(^2j)- Note that 



= ^a'D^aCl + op(1)) - I 3 - -(1 + op(1))/ 4 + r (1 + o P (l))I 5 . 
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(A.5) 



op(l) max 



Define Fj = ^/fi2^e 2 j/cx'D ex., then a F . = vax(Fj) < 1 for all j. For the 
normal distribution, we have the following tail probability inequality: 



11 

1 - $0) < ^=-e 



■x 2 /2 



2tt x 



Since Fj ~ N(0,a F ), by the above inequality we have 



P(\Fj\ >x) <2expj- 



x 

2C 



with C some positive constant, for all x > and j = 1, . . . ,p. By Lemma 
2.2.10 of van der Vaart and Wellner [(1996), page 102], we have 



(c»'T>~ 1 a)~ 1 E max 



n^EmaxlFA < K\ C\og(p + l)/n 2 — >0, 



where X is some universal constant. This together with (A.5) ensures that 



(a'D 1 a) 1 max 



0\ 



(%) - xjt^j 



o"7 



OP (l). 



Hence, 
(A.6) 



h = h + o.'T>- l OLO P (l) 



Now we only need to consider I 3 . Note that I3 = J2 ^ie 2 j ~ N(0, ^-ct'D 1 

SD^^). Since the variance term can be bounded as 

a'D _1 £D -1 a < A max (R)Q'D _1 Q;, 

by the assumption that ncc'D -1 ** — > 00 and A max (R) is bounded, we have 
^3 = ^a.'~D~ l a.op{l). Combining this with (A.6) leads to 

I 3 = ±a'D _1 aop(l). 
We now examine Z4 and 1$. By the similar proof to (A. 3) above we have 

h = p/nx + o P (yJnp/ \nin 2 )) and I 5 = p/n 2 + o P (^np/(nin 2 )). 
Thus the numerator can be written as 
(ix 1 -p,)'t>~ 1 (fi 1 -fi 2 ) 



(l + Op(l))|^a|/cr| - (p/m -p/n 2 )/2 + o P (Jnp/(nin 2 )) 
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and by (A.4) 

y/nin 2 /(pn)J2u 2 j /(Tj{l +o P (l)) + vVC^i^X"! - n 2 ) 



2{1 + (nm 2 /(pn)) E <x 2 jM(l + opO-))} 1 ' 2 



Since y j^ ( ^ x is an increasing function of x and > C p , in view of (A.l) 
and the definition of the parameter space T, we have 

W(6) = 1 * ( [ n WM] 1 / 2 ^! + + P("i ~ ^)/(nm 2 C p ) } ' 
1 J V 2v^{l + mn 2 /(pn)C P (l + op(1))}V2 

If p/(nC p ) -> 0, then W{5) = 1 - $(±[nin 2 /(pn6 )] 1/2 C p {l + o P (l)}). Fur- 
thermore, if { "p^ 2 } 1 / 2 ^ — ► Cq with Cq some constant, then 



2Vbo 



This completes the proof. □ 



Proof of Theorem 2. Suppose we have a new observation X from 
class C\. Then the posterior classification error of using 5 a (-) is 

W(5 a , 6) = E a [P(5 a (X) < 0\Y kl ,i = 1, . . . ,n k , k = 1, 2, a)] 

= 1 - ,B a $(1' a sign(a , /i 1 - a'/2 2 )), 

where \I/ a = a ^= a M , <&(•) is the standard Gaussian distribution function, 

Va'Sa 

and E a means expectation taken with respect to a. We are going to show 
that 

(A.7) *a^0, 

which together with the continuity of <&(•) and the dominated convergence 
theorem gives 

lim.E a $( 1 I'aSign(a / /2 1 - a'/2 2 )) = 1/2. 

Therefore, the posterior error W(5 a ,9) is no better than the random guess- 
ing. 

Now, let us prove (A.7). Note that the random vector a can be written 

as 

a=Z/||Z||, 

where Z is a p-dimensional standard Gaussian distributed random vector, 
independent of all the observations Y^j and X. Therefore, 

(A 8) * _ a'/Zj - a'fi, _ Z'g/y/p - y/n/ (n^pjZ'y '{n^/njei + e 2 )] 
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where a. = /i 1 — /i 2 and = ^Y^i=i e ki> k = 1,2. By the singular value 
decomposition we have 

£ = Q'VQ, 

where Q is an orthogonal matrix and V = diag{Ai, . . . , A p } is a diagonal 
matrix. Let Z = QZ, then Z is also a p-dimensional standard Gaussian 
random vector. Hence the denominator of Vl/ a can be written as 

/ 1 P \l/2 



2 v /Z'SZ/p = 2 ~Y. X jZj 



P ■ 



where Zj is the jth entry of Z. Since it is assumed that lim p 4^ Y^=i X j < 00 

and lim p ~ 2~Zj=i = r f° r some positive constant r, by the weak law of large 
numbers, we have 



G 



^2 P. 



■ T. 



(A.9) -J2\ jZ; 

Next, we study the numerator of ^ a in (A. 8). Since ^J^j=i a "j —* 0, the 
first term of the numerator converges to in probability, that is, 

(A.10) 



Let e = y™Kei + e 2 ) and § = V^Qe, then e has distribution N(0,I) 

and is independent of Z. The second term of the numerator can be written 

as 

/ p 

Z'tymnzMei + e 2 )] = Z'V 1 /^ = £ ^/A^, 

Since — - — y^_i A,- — ► < oo, it follows from the weak law of large number 
that 

v 



n 



El ~ ~ r 

JXjZjij >0. 



nin 2 p .__ 

This together with (A. 8), (A.9) and (A. 10) completes the proof. □ 
We need the following two lemmas to prove Theorem 3. 

Lemma A.l [Cao (2007)]. Let n = m + n 2 . Assume that there exist < 

c\ < c 2 < 1 such that c\ < ni/n 2 < c 2 . Lei Tj = Tj Then 

y "i 3 7 n i+°i J -/ ri 2 

/or any x = x(ni,rt 2 ) satisfying x — > oo and x = o(n 1//2 ), 

log P(Tj > x) ~ —x 2 /2 as ni,re 2 — ► oo. 
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If in addition, if we have only E\Yuj\ 3 < oo and E\Y 2 ij\ 3 < oo, then 

P } T] ^ X 4 = 1 + 0(1)(1 + xfn^d 3 forO<x< n l ^/d, 

where d= (E\Yuj\ 3 + E\Y 2i j\ 3 ) / (\ai(Yuj) + vai(Y 2i j)) 3 / 2 and 0(1) is a finite 
constant depending only on C\ and c 2 . In particular, 



P(Tj > x) 
1 - $(x) 



uniformly in x £ (0,o(n 1 / 6 )). 



Lemma A. 2. Suppose Condition 1(b) ZioZds and logp = o(n). Let S k j be 
the sample variance defined in Section 1, and crjy be the variance of the jth 
feature in class C k . Suppose rnina^- is bounded away from 0. Then we have 
the following uniform convergence result 



max \bu a — o~ua \ — >U. 
fe=l,2, j=l,...,p J 



Proof. For any e > 0, we know when n k is very large, 



P 



max \S kj — a kj \ > e 
fc=l,2, j=i,...,p 



< E E p (l^-^l> £ ) 

fc=i,2i=i 



(A.ll) 



<EE^ 

fc=l,2j'=l V 

+ E E p 

fc=l,2j=l 
= /l+/ 2 . 
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E 6 kij 
i=l 



> n k y/e 2 



It follows from Bernstein's inequality that 
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Since logp = o(n), we have I\ = o(l) and I2 = o(l). These together with 
(A. 11) completes the proof of Lemma A. 2. □ 

Proof of Theorem 3. We divide the proof into two parts, (a) Let us 
first look at the probability P(maxj >s \Tj\ > x). Clearly, 

(A.12) p(mxx.\Tj\ >x) < ^ P(\Tj\>x). 

Note that for all j > s, otj = fiji — fij2 = 0. By Condition 1(b) and Lemma A. 1, 
the following inequality holds for < x < n 1 ^ /d, 

P(Tj > x) = (1 - + C(l + xfn- 1/2 d 3 ), 

where C is a constant that only depends on c\ and C2, and 

d = (E\Y Ul \ 3 + E\Y 2ij \ 3 )/(a% + a\f 12 

with <rjL the jth diagonal element of For the normal distribution, we 
have the following tail probability inequality 

l_$( x )< _L -e~ x2 l 2 . 



This together with the symmetry of Tj gives 

P(\TA >x) <2-^=-e- x2/2 {l + C(l + xfn~ 1/2 d z ) 



Combining the above inequality with (A.12), we have 

Y.P(\T,\ >x)<(p- s)^=-e-' x2/2 {l + C(l + xfn- 1/2 d 3 ) 
j>s v2tt x 

Since log(p — s) = o(n 7 ) with < 7 < |, if we let x ~ cn 7//2 , then 

which along with (A.12) yields 

pfmax\Tj\ >x] -> 0. 



(b) Next, we consider P(minj< s |Tj| < x). Notice that for j < s, ay 

' „ , and d 



Mli - M2j / 0. Let w = - 7 _a_ and define 
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Then following the same lines as those in (a), we have 

Y P{\fA >x)< s -JLI e - 2 / 2 (l + C(l + xfn-^d 3 ) - 0. 

It follows from Lemma A. 2 that, 

maxIS'L -a\A -Ao, A; = 1,2. 
Hence, uniformly over j = 1, . . . , s, we have 

Vi= I 2 . = 2 . (1 + <*(*))■ 

y a lj/ n l + cr 2j/ n 2 

Therefore, 



min Tfr = nun v = ===(1 + o P (l)) > min = (l + o P (l)) 



with C2 defined in Theorem 3. Let ao = minj< s — fJ>j2\/y &{j + 
Then it follows that 



P\ min IT,- 1 < x ) < P\ max |T,-| > min In,-! — x ) 

\j<s 71 J \j<s Jl j<s Ul J 



< P[ max \TA > •Jnxaoil + op(l)) — x 

By part (a), we know that x ~ cn 7 / 2 and log(p — s) = o(n 7 ). Thus if ao ~ 
minj< s ^= ^= = n~ 7 /3 n for some /? n — > oo, then similarly to part (a), we 

have 



P ( min I To I <x — > 0. 
Combination of part (a) and part (b) completes the proof. □ 



Proof of Theorem 4. The classification error of the truncated clas- 



sifier 5^ c is 



We first consider the denominator. Note that &j ~ iV(a,-, ). It can be 

J \ J ' nin2 ' 

shown that 



4n , 2m?i 2 \ ^ 2 / 9 9 n. \ n »w s 

E«"+tct £ -^(0,1), 



n l n 2~[ n l n 2 ) j=L^ niTl2 
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which together with the assumption -7= Y^jLi a ] ~ * 00 gives 

m m ( , m n 2 ~1 

^a 2 — y^a 2 mU < > a 2 I q h\ 



(l + op(l))^a 2 + 



mn 



j=l n l n 2 

Next, let us look at the numerator. We decompose it as 

m m m m 

(A.13) Yl - fa) = \ a ) - Yl a fcj - 1 Y^ij - g D- 
j=i j=i j=i j=i 

Since the second term above has the distribution N(0,J2^Li a j/ n 2), it fol- 
lows from the assumption n Y^JLi a j 00 that 



m m 
3=1 3=1 



Y j a j e 2 j=o P (l)Y j a) 



The third term in (A.13) can be written as 



in 



2^ e ii ~ e 2j = + P = + o P (l 2^ a-. 

J~i 7 ni n 2 \nin 2 ) n x n 2 p 1 J 



Hence the numerator is 



E <%K' - fa) = + ° P{1)) 



Therefore, the classification error is 

W(fe0) = i-$ 



(1 + op(l)) J2f=i Qj + - rc 2 )/(nin 2 ) 
2{(1 + 0P (1)) Ef=i a$ + mn/(nm 2 )}V2 ; " 

This concludes the proof. □ 

Proof of Theorem 5. Note that the classification error of <5pAij{ is 
^ A1R (x), a) - ! - = 1 - «(*-)■ 

We divide the proof into two parts: the numerator and the denominator, 
(a) First, we study the numerator of \P . It can be decomposed as 

E(^y - Ai)«il{|ail > M = h+ h, 
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where h = T,jeA^ij ~ A; )«?'!{ I «j I > K} and I 2 = Eje^G^i ~ Aj)«j x 
l{|cij| > &n} with A c the complementary of the set A. Note that 

^ = 5^ af i{|a,-| >b n } - aj€ 2 jl{|dj| > b n } 
jeA c jeA c 

-JE(^-%)i{|ajl>M 

= |-^2,1 — -^2,2 — 5^2,3- 

Since dj ~ N(ctj, n ™ n ^ ) , it follows from the normal tail probability inequality 
that for every j £ A c and b n > max Jg _4c \(Xj\, 



P(\&j\ > b n ) < Pi \ctj — aj\ >b n — max | ay 
\ j^A c 

(A. 14) 

r exp{-nin 2 (6 n - maxj gj4 = \otj\) 2 /(2n)} 



< M- 



^nin 2 n 1 (b n - max Jg ^c \aij\) 



where M is a generic constant. Thus for every e > 0, if log(j> — m)/[n(b n — 
maxj g _4c | o; j | ) 2 ] — > and maxjg^c \aj\ < b n , we have 

P(\h,i\ > e) < E > 

< Mmaxa 2 ^ - ex Pi~ re i re 2(frn ~ maxjgA^ |«j|) 2 / (2n)} 
je.A c J e \Jniii2n~ 1 (b n — maxjg^c \atj\) 

which tends to zero. Hence, 

(A.15) / 2) i-^0. 

We next consider 72,2- Since E(e2j) 2 = log(p — m)/[n(b n — 
maxjg^c |«j|) 2 ] — ► 0, and maxjg_4c \ctj\ < b n , we have 

P(|/ 2 ,2| > e) < e- 1 J2 E \wl{\&j\ > b n }\ 
jeA c 

< e^ 1 E {i?(%) 2 } 1/2 {^l« J 2 l{l« J l > &n}|} 1/2 

< (p - m) maxjg^c | Qj -| exp{-mn 2 (fe n . - max^c \aj\) 2 / (An)} 

e^Jni \Jn\ri2n- x {}) n - maxjg^c \aj\) 

which converges to 0. Therefore, 

(A.16) / 2 ,2-^0. 
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Then, we consider /2,3- Since c\ < ni/ri2 < c 2 and E{e\j — e 



2j> 



3n ' + %i nin2 < 3C2 Sf C1 » b y ( A - 14 ) we have for ever y £ > °> 



p(|/ 2 ,3i > e) < ^ E ( e t- - > fen} 

< e" 1 E {^(% " ^?} 1/2 P(\^\ > fen) 1/2 

< M T.^M P (\H >bn) ^ 

P 

where M is some generic constant. Thus, J23 — >0. Combination of this 
with (A. 15) and (A. 16) entails 

h = o P (l). 

We now deal with I\. Decompose I\ similarly as 

h = E^i ~~ Ali)0:j 1 {|«il > fen} + I E "i 1 !!^! > fen} 

je.4 ie.4 

j&A jeA 
= A,l + 2-^1,2- 

We first study Ii >2 . By using aj ~ N(aj, ^7^), it can be shown that 

Sin« TfeEji,"? - 00, we have (^Ejo,^ + S^VE^ - °- 
Therefore, 

nm 



(l + op(l))E«i + ^fe 2 - 



Next, we look at For any e > 0, 



JG.4 

1 



ni£ jeA 



E y« 2 + n/(nin 2 ). 
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When n is large enough, the above probability can be bounded by 



p(|/i,il>£)<V 2 /(^ 2 )EK-l> 

jeA 

which along with the assumption J2jeA \ a j\/[V^J2jeA a 'j'\ gives 

h,x = op(l) E a )- 

j&A 

It follows that the numerator is bounded from below by 
(b) Now, we study the denominator of ^f. Let 

Y,a 2 j l{\a j \ > b n } = E d 2 l{|dj| > M + E a 2 l{\aj\ > K) = Ji + h- 
j j£A jeA c 

We first show that J 2 0. Note that Ealj = aj + 6re(?ii?i2) _1 a 2 + 3n 2 (nin 2 )~ 2 . 
Thus, 

P(\J2\ >e)< -E\J 2 \ = > b n }/e < - £ {Efi}P(|&,-| > 6„)} 1/2 

< - E {(tf + 6n(n 1 n 2 y l a 2 j +3n 2 (n 1 n 2 y 2 )P(\a j \>b n )} 1/2 . 

This together with (A. 14) and the assumption that log(p — m)/[n(b n — 

maxj g _4c \a.j\) ] — > yields J 2 — >0 as n — > 00, p — > 00. Now we study term J±. 
By (A. 17), we have 

j 1 <^a| = (i + OP (i))E^ 2 + ^- 

yA yA ""' 2 

Hence the denominator is bounded from above by (1 + op(l)) a | + 

mn -. Therefore, 



ni«2 



> (1 + op(l))Eje^« 2 + (mn/(nira 2 )) - ?nfr 2 



2^(1 + 0P (1)) E ie ^ a 2 + (mn/( ni n 2 )) 
It follows that the classification error is bounded from above by 
1 ^ / (1 + Qp(1)) Ejg^ « 2 + (mn/(nm 2 )) - mfr 2 
^ 2^/(1 + op(1)) E je ^« 2 + (mn/(mn 2 )) 
This completes the proof. □ 
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